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Abstract — We analyze an online learning algorithm that adap- 
tively combines outputs of two constituent algorithms (or the 
experts) running in parallel to model an unknown desired signal. 
This online learning algorithm is shown to achieve (and in some 
cases outperform) the mean-square error (MSE) performance of 
the best constituent algorithm in the mixture in the steady-state. 
However, the MSE analysis of this algorithm in the literature 
uses approximations and relies on statistical models on the 
underlying signals and systems. Hence, such an analysis may not 
be useful or valid for signals generated by various real life systems 
that show high degrees of nonstationarity, limit cycles and, in 
many cases, that are even chaotic. In this paper, we produce 
results in an individual sequence manner. In particular, we relate 
the time-accumulated squared estimation error of this online 
algorithm at any time over any interval to the time-accumulated 
squared estimation error of the optimal convex mixture of the 
constituent algorithms directly tuned to the underlying signal 
in a deterministic sense without any statistical assumptions. In 
this sense, our analysis provides the transient, steady-state and 
tracking behavior of this algorithm in a strong sense without any 
approximations in the derivations or statistical assumptions on 
the underlying signals such that our results are guaranteed to 
hold. We illustrate the introduced results through examples. 

Index Terms — Learning algorithms, mixture of experts, deter- 
ministic, convexly constrained, steady-state, transient, tracking. 



I. Introduction 

The problem of estimating or learning an unknown desired 
signal is heavily investigated in online learning (I |-[7] and 
adaptive signal processing literature |8|-[11|. However, in 
various applications, certain difficulties arise in the estimation 
process due to the lack of structural and statistical information 
about the data model. To resolve this lack of information, mix- 
ture approaches are proposed that adaptively combine outputs 
of multiple constituent algorithms performing the same task 
in the online learning literature under the mixture of experts 
framework [5|-[7| and adaptive signal processing under the 
adaptive mixture methods framework (8}-| 10 1. These parallel 
running algorithms can be seen as alternative hypotheses 
for modeling, which can be exploited for both performance 
improvement and robustness. Along these lines, an online 
convexly constrained mixture method that combines outputs of 
two learning algorithms is introduced in |9). In this approach, 
the outputs of the constituent algorithms that run in parallel 
on the same task are adaptively combined under a convex 
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constraint to minimize the final MSE. This adaptive mixture 
is shown to be universal with respect to the input algorithms 
in a certain stochastic sense such that this mixture achieves 
(and in some cases outperforms) the MSE performance of 
the best constituent algorithm in the mixture in the steady- 
state [9|. However, the MSE analysis of this adaptive mixture 
for the steady-state and during the transient regions uses 
approximations, e.g., separation assumptions, and relies on 
statistical models on the signals and systems, e.g., stationary 
data models J9), |10|. In this paper, we study this algorithm 



from the perspective of online learning and produce results 
in an individual sequence manner such that our results are 
guaranteed to hold for any bounded arbitrary signal. 

Nevertheless, signals produced by various real life systems, 
such as in underwater acoustic communication applications, 
show high degrees of nonstationarity, limit cycles and, in many 
cases, are even chaotic so that they hardly fit to assumed 
statistical models fl2) . Hence an analysis based on certain 
statistical assumptions or approximations may not be useful 
or adequate under these conditions. To this end, we refrain 
from making any statistical assumptions on the underlying 
signals and present an analysis that is guaranteed to hold for 
any bounded arbitrary signal without any approximations. In 
particular, we relate the performance of this learning algorithm 
that adaptively combines outputs of two constituent algorithms 
to the performance of the optimal convex combination that 
is directly tuned to the underlying signal and outputs of the 
constituent algorithms in a deterministic sense. Naturally, this 
optimal convex combination can only be chosen in hindsight 
after observing the whole signal and outputs a priori (before 
we even start processing the data). Since we compare the 
performance of this algorithm with respect to the best convex 
combination of the constituent filters in a deterministic sense 
over any time interval, our analysis provides, without any 
assumptions, the transient, the tracking and the steady-state 
behaviors together |5}-||7)- In particular, if the analysis window 
starts from t = 1, then we obtain the transient behavior; if the 
window length goes to infinity, then we obtain the steadys- 
tate behavior; and finally if the analyze window is selected 
arbitrary, then we get the tracking behavior as explained in 
detail in Section III. The corresponding bounds may also hold 
for unbounded signals such as with Gaussian and Laplacian 
distributions, if one can define reasonable bounds such that 
the effect of samples of the desired signal that are outside of 
an interval on the cumulative loss diminishes as the data size 
increases as demonstrated in Section III. 

After we provide a brief system description in Section [II] 
we present a deterministic analysis of the convexly constrained 
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The Convexly Constrained Algorithm: 



Parameters: 

p > 0: learning rate. 
Inputs: 

yt: desired signal. 

yi,t,V2.t- constituent learning algorithms. 
Outputs: 
yt : estimate of the desired signal. 



Initialization: Set the initial weights Ai = 1/2 and p\ = 0. 

for t = 1 : . . . : n, 

% receive the constituent algorithm outputs y± y t and j/2,t and 
% estimate the desired signal 

yt = Adh.t + (i - A t )#2, t 

% Upon receiving yt, update the weight according to the rule: 
pt+l = pt + pe t A t (l - \t)[yi,t - V2,t] 
At+i 
endfor 



l +e -Pt+i 



TABLE I: The learning algorithm that adaptively combines 
outputs of two algorithms. 



mixture algorithm in Section III where the performance 



bounds are given as a theorem and a lemma. We illustrate 



the introduced results through examples in Section IV The 
paper concludes with certain remarks. 

II. Problem Description 

In this framework, we have a desired signal {z/t} f>1 , where 
\y t \ < Y < oo, and two constituent algorithms running in 
parallel producing {yi t t}t>i and {i/2.t}t>i, respectively, as 
the estimations (or predictions) of the desired signal {yt] t >v 
We assume that Y is known. Here, we have no restric- 
tions on yi t or y 2 ,t, e -g-> these outputs are not required to 
be causal, however, without loss of generality, we assume 
|yi,t| < Y and \y~2,t\ < Y, i.e., these outputs can be clipped 
to the range [— Y, Y] without sacrificing performance under 
the squared error. As an example, the desired signal and 
outputs of the constituent learning algorithms can be single 
realizations generated under the framework of j9j. At each 
time t, the convexly constrained algorithm receives an input 

vector x t = [j/i,t Z/2,t] T and outputs 



yt 



X t yi,t + (1 - A t )y 2 ,t = wjx t , 



A 



where w t = [Xt (1 — A f )] T , < At < 1, as the final estimate. 
The final estimation error is given by e t = yt — yt- 

The combination weight X t is trained through an auxiliary 
variable using a stochastic gradient update to minimize the 
squared final estimation error as 



A, 



1 



1 



-pt 



Pt+i = Pt~ ptV p e 



P^t \o= 



p=pt 



p t + pe t X t (l - \t)[vi,t ~ V2,t] 



(1) 



(2) 



where p > is the learning rate. The combination parameter 
A t in ([TJ is constrained to lie in [A+, (1- A+)], < A+ < 1/2 
in ||9), since the update in |2| may slow down when At is too 
close to the boundaries. We follow the same restriction and 
analyze (|2]i under this constraint. The algorithm is presented 
in Table |J 



Under the deterministic analysis framework, the perfor- 
mance of the algorithm is determined by the time-accumulated 
squared error j5), |7), |[T3|-p3|. When applied to any se- 
quence {yt] t >v trie algorithm of ([1} yields the total accu- 
mulated loss 



L n {y,y) = L n (wjx t ,y) = ^2(y, 



yt) 



(3) 



for any n. We emphasize that for unbounded signals such as 
Gaussian and Laplacian distributions, we can define a suitable 
Y such that the samples of y t are inside of the interval [-Y, Y] 
with high probability and the effect of the samples that are 
outside of this interval on the cumulative loss ^ diminishes 
as n gets larger. 

We next provide deterministic bounds on L n (y,y) with re- 
spect to the best convex combination min L„ (yB, y), where 

0e[o,i] 



L n {yp,y) = L n (u T x t ,y) = ^T(yt 



yp,t) 



t=l 



and 



A 



yp,t = Pi)i,t + (i - P)h,t = u T x t , 



u = [P 1 — f3] T , that holds uniformly in an individual sequence 
manner without any stochastic assumptions on y t , ijij, 2/2, t or 
n. Note that the best fixed convex combination parameter 

f3 a = arg min L n (yp,y) 
i8e[o,i] 

and the corresponding estimator 

Vp„,t = Poyi,t + (1 - Po)V2,t, 

which we compare the performance against, can only be deter- 
mined after observing the entire sequences, i.e., {yt},{yi,t} 
and {y2,t}< m advance for all n. 

III. A Deterministic Analysis 

In this section, we first relate the accumulated loss of 
the mixture to the accumulated loss of the best convex 
combination that minimizes the accumulated loss in the 
following theorem. Then, we demonstrate that one cannot 
improve the convergence rate of this upper bound using 
our methodology directly and the Kullback-Leibler (KL) 
divergence [6] as the distance measure by providing counter 
examples as a lemma. The use of the KL divergence as a 
distance measure for obtaining worst-case loss bounds was 
pioneered by Littlestone fT6| , and later adopted extensively in 
the online learning literature |6j, J7), JT7J. We emphasize that 
although the steady-state and transient MSE performances 
of the convexly constrained mixture algorithm are analyzed 
with respect to the constituent learning algorithms |9|, |10|, 
we perform the steady-state, transient and tracking analysis 
without any stochastic assumptions or use any approximations 
in the following theorem. 
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Theorem: The algorithm given in pi, when applied to any 
sequence {yt} t>1 , with \y t \ <Y<oo, yields, for any n and 

e>0 



L n{y,y) 



where O (.) is the order notation, ypj = f3yi t t + (1 — P)il2,t, 

z = i+4A+(i-a+) < ^ m< ^ ste P S ^ ze f 1 = 2^fr^r^' P rov ided 
that A t £ [A + , 1 — A + ] for all t during the adaptation. 



This theorem provides a regret bound for the algorithm Q 
showing that the cumulative loss of the convexly constrained 
algorithm is close to a factor times the cumulative loss of 
the algorithm with the best weight chosen in hindsight. If we 
define the regret 



A 



R n = L n (y,y) - 



2e + 1 



{L n (V0,y)}, (5) 



1 - z 2 ) 06 [6", 1] 
then equation (RJ) implies that time-normalized regret 



Rn A L n (y,y) (2e+\\ . ( L n (yp,y) 
i i mm i 



1 - z 2 I pe[o,i] 



converges to zero at a rate Of—) uniformly over the desired 
signal and the outputs of constituent algorithms. Moreover, (HJ) 
provides the exact trade-off between the transient and steady- 
state performances of the convex mixture in a deterministic 
sense without any assumptions or approximations. Note that 
(HJ is guaranteed to hold independent of the initial condition 
of the combination weight A t for any time interval in an 
individual sequence manner. Hence, (HJ also provides the 
tracking performance of the convexly constrained algorithm 
in a deterministic sense. From Q, we observe that the 
convergence rate of the right hand side, i.e., the bound, 
is O (i), and, as in the stochastic case jlOJ , to get a 
tighter asymptotic bound with respect to the optimal convex 
combination of the learning algorithms, we require a smaller 
e, i.e., smaller learning rate fi, which increases the right 
hand side of (HJ). Although this result is well-known in the 
adaptive filtering literature and appears widely in stochastic 
contexts, however, this trade-off is guaranteed to hold in here 
without any statistical assumptions or approximations. Note 
that the optimal convex combination in Q, i.e., minimizing 
/?, depends on the entire signal and outputs of the constituent 
algorithms for all n and hence it can only be determined in 
hindsight. 

Proof: To prove the theorem, we use the approach introduced 
in [7 1 (and later used in |6|) based on measuring progress of 
a mixture algorithm using certain distance measures. 

We first convert |2]) to a direct update on X t and use this 
direct update in the proof. Using 

e - Pt l^i 



from Q, the update in |2]i can be written as 
1 



At+i 



1 + e-P*+i 



1 



1 _)- g-p*-Aie t A t (l-A t )[#i,f-l/2,t] 
1 

I _|_ 1-A * e -/Je t A t (l-A t )[j) M -j;2,t] 
At 



A p ^e t A t (l-A t )2/i,i 
\ te He t X t (l-\t)yi tt _|_ n _ \A e ne t Xt(l-\t)V2,t ' 

Unlike |6) (Lemma 5.8), our update in |6| has, in a certain 
sense, an adaptive learning rate /iA t (l — At) which requires 
different formulation, however, follows similar lines of (6) in 
certain parts. 

Here, for a fixed /? g [0, 1], we define an estimator 

Vp,t = Pvi,t + (1 - P)fa,t = u T x t , 
where j3 € [0, 1] and u = \fi 1 - f3] T . Defining 

A = e A»etA*(l-A t ) 

we have from ^ 

^lnf^+(l-^)ln A_At+1 



. A t ) ' K ~ '~'"'\ 1-A t 
= y p , t InCt - In [\tCf'* + (1 - At)Cf 2 '') ■ (7) 
Using the inequality 

a x <l-x(l-a) 
for a > and x £ [0, 1] from [7], we have 

c , M = (cr) ^ c _ y 

<cr y (i- -CD 

which implies in Q 

In (AtCr + U-AtKf 2 '* 



<in cra- 



Xyi,t + (l-X t )y 2 , t +Y 
2Y 



= -Vl,K ( -Mn(l-^tZ(l-Cr) 



(i-cD) 



(8) 



where y t = XtVi,t + (1 — ^t)y2.t- As in j6J, one can further 
bound ([8| using 

„2 
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ln(l-g(l-e p )) < pq + 
for < q < 1 (originally from |7|) 

ln(A t Cr + (l-A t )Cr) 

<-yinCt + (yt+r)lnCt+ \ W ■ (9) 



Using (j9j in (|7j yields 



/31n 



At+i 



+ (1-/3) In 



> 



(y{,, t + Y)]xL(t-(y t +Y)]n{ t 



1 - At+i 
1-At 

^ 2 (lnC t ) 2 



(10) 



4 



At each adaptation, the progress made by the algorithm 
towards u at time t is measured as D(u\\w t ) — D(u\\w t +i), 
where w t — [Xt (1 ~ A*)] T and 



D(u\\w) = ^Ujlnjui/wi) 
»=i 

is the KL divergence |l8), m £ [0, l] 2 , to € [0, l] 2 . We 



For ( fl4| > to be negative, defining k = At(l — Aj) and 

,r 2 l 



require that this progress is at least a(t/t 
for certain a, 6, ju (6), (7), i.e., 

o(j/t - yt) 2 - &(y* - y/3,t) 2 

< D(u||to t )-I)(u||iBt + i) 



■VtY-Kyt-yptY 



= /31n 



A/ 



+ (1-/3) In 



1 - A 



t+i 



1-At 



(11) 
after 



which yields the desired deterministic bound in (]? 
telescoping. 

In information theory and probability theory, the KL di- 
vergence, which is also known as the relative entropy, is 
empirically shown to be an efficient measure of the distance 
between two probability vectors ||6j, |7J, (T8j. Here, the vectors 
u and w t are probability vectors, i.e., u,w t £ [0, l] 2 and 
u T l = wfl = 1, where 1 = [1 1] T . This use of KL 
divergence as a distance measure between weight vectors is 
widespread in the online learning literature |6), fPJ) , (17). 

We observe from ( flT) and (jTOf that to prove the theorem, 
it is sufficient to show that G(yt,yt,yp,t, Ct) < 0' where 



A 



G(y t ,yt,yp,t, Ct) = -{yp.t + Y) InCt + (y t + Y) InCt 



y 2 (inc t ) 2 



+ a{y t - ytf - b{y t - y>,t) 2 



(12) 



For fixed y t ,yt,(u G(y u y t ,y 0tt ,(t) is maximized when 
dG - 0, i.e., 



yp.t -yt + 



26 







Note that 



since £rf- = -2b < 0, yielding y* t 
while taking the partial derivative of G(-) with respect to yp tt 
and finding y% t , we assume that all yt,jjt,Ct are fixed, i.e., 
their partial derivatives with respect to ypj is zero. This yields 
an upper bound on G(-) in terms of y^ t . Hence, it is sufficient 
to show that G(y t , y t , y*p,v Ct) < such 11131 Q 

G(yt,y,y*p, t ,(t) 

{ft +y- ^) In Ct + + In Ct 
r 2 (lnCt) 2 



a(y t - y*) 2 - 

y 2 (lnCt) 2 
2 

(yt - yt) 2 x 



lnCt + (yt 

a(yt - yt) 2 - 
- (yt - yt)inCt 



(InCt) 2 
46 

, (InCt) 2 
46 



(13) 



a - /uA 4 (l - A t ) 



M 2 V(i-A t ) 2 , yVV(i-A t ) 



46 



(14) 



= k \x 



it is sufficient to show that H(k) < for k £ [A+(l- A+), \], 
i.e., k £ [A+(l-A+), i] when A t £ [A+, (1-A+)], since H(k) 
is a convex quadratic function of k, i.e., > 0. Hence, 
we require the interval where the function H(-) is negative 
should include [A + (l — A + ), |], i.e., the roots ki and k 2 (where 
&2 < fci) of if (•) should satisfy 



h>-, fc 2 <A+(l-A+), 



where 

ki = 

k 2 = 
and 



V(^ 1 1 



16' 



1 + Vl - 4as 
2^s ' 



M--\/^ 2 -Va(¥ + 35) 



V(¥ 



46/ 



1 - VI - 4as 
2/is 



(15) 
(16) 



i 



2 46, 

To satisfy fci > 1/4, we straightforwardly require from ( fT5j ) 

2 + 2-v/l - 4as 



> 



/i. 



To get the tightest upper bound for ( fl"5j ), we set 

2 + 2V1 - 4as 

l" = , 

s 

i.e., the largest allowable learning rate. 

To have k 2 < A+(l - A+) with fj, = 2+2 ^~ 4as , from ^ 
we require 



1 - v/1 - 4as 



4(1 + VI - 4as) 
Equation ( fTT) yields 

/ya 

as = a I 

where 



< A+(l-A^ 



J_\ 1-z 2 
46 J - 4 : 



(17) 



(18) 



A 1-4A+(1-A+) 
*~ 1 + 4A+(1-A+) 

and z < 1 after some algebra. 

To satisfy ([18}, we set 6 = ^ for any (or arbitrarily small) 
e > that results 

(l-z 2 )e 



a ~ y 2 (2e + l)' 
To get the tightest bound in (jTTJ, we select 

= (i-y')e 

F 2 (2e+ 1) 

in ( |T9] >. Such selection of a, 6 and /i results in ( fTT| 

y 2 (2eVl) ) ~ ^ ~ (^) ^ ~ ^ ,t)2 



(19) 



A, 



1-A f 



(20) 
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After telescoping, i.e., summation over t, 53"=i> yields 

)- 

Ki+i 



aL n (y,y) - b mm . {L n (y ,y)} 



<fim 



Ai 



+ (1 - fi) In 



l — Ki+i 



l-Aj 



<0(1), (21) 



so that 



<0(1). 
Hence, it follows that 

L n (y,y) - 



2e+ 1 
1-z 2 



/3e[o,i 



{-MZ//3,2/)} 



(2e + l)r 2 



<" Z 7T ^0(1)<0 
ne(l — z z J \ e 

which is the desired bound. 
Note that using 



(22) 

(23) 
(24) 



we get 



~y 2 ' r 2 (2e + i)' 

2 + 2^1 - 4as 
A* = — 



y2 

2 



1 

46 



4e 2 + 2z 



s 2e + l F 2 ' 

after some algebra, as in the statement of the theorem. This 
concludes the proof of the theorem. □ 

In the following lemma, we show that the order of the upper 
bound using the KL divergence as the distance measure under 
the same methodology cannot be improved by presenting an 
example in which the bound on b is of the same order as that 
given in the theorem. 

Lemma: For positive real constants a, b and fi which satisfies 
(□} for all \y t \ < Y, \y ht \ < Y and \y 2 , t \ < Y and A t e 
[A + , (1 — A + )], we require 

b > 4a- 



4A+(1 - A+)' 



Proof: Since the inequality in ( fTT| should be satisfied for all 
possible y t , yi t t, V2,u fi an d A t , the proper values of a, b and 
fj, should satisfy (jTTJ for any particular selection of y t , y\ t t, 
2/2, u fi an d -V First we consider 

Vt = Vi,t = Y, y 2 ,t = o, fi = 1, A t = A+, 

(or,similai-ly, y t = y 1<t = Y, y 2 ,t = ~Y and A t = A + ). In this 
case, we have 

a(Y- A+F) 2 

< -ln(A+ + (1 - A +) e ^(>'-^nA + (i-A+)(-y) ) 

< -A+ In 1 - n(l - A+) 2 A+y(l - A+)(-y) (25) 
-/i(l-A+) 3 A+y 2 , (26) 



where p5| ) follows from the Jensen's Inequality for concave 
function ln(-). By |26|, we have 

a 

M " A+(l-A+)' 



(27) 



For another particular case where 

y t = -Y/2, y ht = 0, y 2 ,t = Y, fi = 1, A t = 1/2, 
we have 

a(-F) 2 -6(-~) 2 
<_ m( I+l e M-nl(-¥)) 
1 Y 2 

- - 2"T' 

where ( |28| ) also follows from the Jensen's Inequality. By ( |28] i, 
we have 



(28) 



u a 

b > 4a + - > 4a H — — . 

- 4 - 4A+(1 - A+) 



(29) 



where ( p9| ) follows from ( p7j ), which finalizes the proof. □ 

IV. Simulations 

In this section, we illustrate the performance of the learning 
algorithm |2]) and the introduced results through examples. We 
demonstrate that the upper bound given in Q is asymptotically 
tight by providing specific sequences for the desired signal y t 
and the outputs of constituent algorithms yi, t and y 2 ,t- We also 
demonstrate that to get a tighter asymptotic bound, we require 
a smaller learning rate [i, as suggested by our theoretical 
analysis. 

In the first case, we present the regret of the learning 
algorithm (|2]i defined in <|5j and the corresponding upper bound 
given in <g). We first set Y = 0.5, A+ = 0.08 and (j, = 0.08. 
Here, the desired signal is given by 



Y 



for t — 1, . . . , 10000. For this specific example, the parallel 
running constituent algorithms produce the sequences 

m,t = y, y 2>t = (-i)*y 

for t = 1, . . . , 10000. Note that, in this case, the best convex 
combination weight is fi = 1 and the cumulative loss of the 
best convex combination is since y t and yx >t are identical. 
In Fig. [Ta] we plot the time-normalized regret of the learning 
algorithm |2]) "Time-normalized regret, /.ii = 0.08" and the 
upper bound given in Q "0(l/(nei))". From Fig. la we 
observe that the bound introduced in Q is asymptotically 
tight, i.e., as n gets larger, the gap between the upper bound 
and the time-normalized regret gets smaller. 

In the second case, we set Y = 0.54, A + = 0.08 and /j, 
0.01. Here, the desired signal is given by 

y t = 0.5 

for t = 1,..., 10000. For this example, the constituent 
algorithms produce the sequences 

m,t = y m,t = (-i)*o.5 

for t = 1, . . . , 10000. In this case, the best convex combination 
weight is fi Q = 0.96, however, unlike the first case, the 
cumulative loss of the best convex combination is nonzero. 



In Fig. lb we plot the time-normalized regret of the learning 



6 



0.35 



0.2 





Time-normalized Regret, 
H |= 0.08 




0(1/(^)5 


i 

I \ 

\ \ 


- 


\ \ 
\ \ 

V v — 

— *- — _ 







2000 



4000 



6000 



sunn 



10000 



(a) 



0.12 



DOS 




O.IK. 



0.04 - 



0.02 



(b) 

Fig. 1: Tightness of the regret bound, (a) /ii = 0.08. (b) ^2 
0.04. 



algorithm |2]) "Time-normalized regret, [12 = 0.04" and the 
corresponding upper bound given in Q "0(l/(ne 2 ))" for this 
example. We observe from Fig. [lb] that the bound introduced 
in Q is asymptotically tight. We also observe that, in this case, 
the upper bound is tighter compared to the first case since the 
learning rate, and consequently the parameter e is smaller, as 
suggested by our theoretical results. 

In this section, we illustrated our theoretical results and the 
performance of the learning algorithm |2]l through examples. 
We observed that the upper bound given in Q is asymptot- 
ically tight by presenting two different examples, i.e., two 
different cases for the desired signal y t and the outputs of 
constituent algorithms jji t and y 2 .t- We also observed that to 
get a tighter asymptotic bound, we require a smaller learning 



V. Conclusion 

In this paper, we analyze a learning algorithm [9] that 
adaptively combines outputs of two constituent algorithms 
running in parallel to model an unknown desired signal from 
the perspective of online learning theory and produce results in 
an individual sequence manner such that our results are guar- 
anteed to hold for any bounded arbitrary signal. We relate the 
time-accumulated squared estimation error of this algorithm at 
any time to the time-accumulated squared estimation error of 
the optimal convex combination of the constituent algorithms 
that can only be chosen in hindsight. We refrain from making 
statistical assumptions on the underlying signals and our re- 
sults are guaranteed to hold in an individual sequence manner. 
We also demonstrate that the proof methodology cannot be 
changed directly to obtain a better bound, in the convergence 
rate, on the performance by providing counter examples. To 
this end, we provide the transient, steady state and tracking 
analysis of this mixture in a deterministic sense without 
any assumptions on the underlying signals or without any 
approximations in the derivations. We illustrate the introduced 
results through examples. 
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rate \x, as suggested by the results introduced in Section III 



